Huey Kwik
## [1] 1072770 25
## Classes 'tbl_df', 'tbl' and 'data.frame': 1072770 obs. of 25 variables:
## $ cmte_id : chr "C00575795" "C00575795" "C00575795" "C00577130" ...
## $ cand_id : chr "P00003392" "P00003392" "P00003392" "P60007168" ...
## $ cand_nm : Factor w/ 25 levels "Bush, Jeb","Carson, Benjamin S.",..: 4 4 4 20 20 20 20 4 20 20 ...
## $ contbr_nm : chr "AULL, ANNE" "CARROLL, MARYJEAN" "GANDARA, DESIREE" "LEE, ALAN" ...
## $ contbr_city : chr "LARKSPUR" "CAMBRIA" "FONTANA" "CAMARILLO" ...
## $ contbr_st : chr "CA" "CA" "CA" "CA" ...
## $ contbr_zip : int 949391913 934284638 923371507 930111214 902784310 902784310 920842849 926372912 926833846 949522729 ...
## $ contbr_employer : chr "N/A" "N/A" "N/A" "AT&T GOVERNMENT SOLUTIONS" ...
## $ contbr_occupation: chr "RETIRED" "RETIRED" "RETIRED" "SOFTWARE ENGINEER" ...
## $ contb_receipt_amt: num 50 200 5 40 35 100 25 40 10 15 ...
## $ contb_receipt_dt : Date, format: "2016-04-26" "2016-04-20" ...
## $ receipt_desc : chr NA NA NA NA ...
## $ memo_cd : chr "X" "X" "X" NA ...
## $ memo_text : chr "* HILLARY VICTORY FUND" "* HILLARY VICTORY FUND" "* HILLARY VICTORY FUND" "* EARMARKED CONTRIBUTION: SEE BELOW" ...
## $ form_tp : chr "SA18" "SA18" "SA18" "SA17A" ...
## $ file_num : int 1091718 1091718 1091718 1077404 1077404 1077404 1077404 1091718 1077404 1077404 ...
## $ tran_id : chr "C4768722" "C4747242" "C4666603" "VPF7BKWA097" ...
## $ election_tp : Factor w/ 3 levels "G2016","P2016",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ cand_last_name : Factor w/ 25 levels "Bush","Carson",..: 4 4 4 20 20 20 20 4 20 20 ...
## $ party : Ord.factor w/ 5 levels "Democratic"<"Republican"<..: 1 1 1 1 1 1 1 1 1 1 ...
## $ zip : chr "94939" "93428" "92337" "93011" ...
## $ city : chr "Larkspur" "Cambria" "Fontana" "Camarillo" ...
## $ state : chr "CA" "CA" "CA" "CA" ...
## $ latitude : num 37.9 35.6 34 34 33.9 ...
## $ longitude : num -123 -121 -117 -119 -118 ...
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## -10000.0 15.0 27.0 122.7 100.0 10800.0 1
## [1] "Number of negative contributions: 10617"
Observations:
contb_receipt_amt in the dataset.## # A tibble: 8 × 5
## contbr_nm contb_receipt_dt contb_receipt_amt tran_id
## <chr> <date> <dbl> <chr>
## 1 HOROWITZ, DAVID 2015-07-06 10800 SA17A.123258
## 2 HOROWITZ, DAVID 2015-07-06 -5400 SA17A.123258.0
## 3 HOROWITZ, MICHELLE 2015-07-06 5400 SA17A.123258.1
## 4 HOROWITZ, MICHELLE 2015-07-06 -2700 SA17A.123258.2
## 5 HOROWITZ, MICHELLE 2015-07-06 2700 SA17A.123258.3
## 6 HOROWITZ, DAVID 2015-07-06 -2700 SA17A.123258.4
## 7 HOROWITZ, DAVID 2015-07-06 2700 SA17A.123258.5
## 8 HOROWITZ, MICHELLE 2015-07-06 5400 SA17A.123263
## # ... with 1 more variables: election_tp <fctr>
I looked at some examples of contributions that were above $2700 and came across David and Michelle Horowitz. They appear to be a couple who donated to Scott Walker’s campaign.
Summing up contb_receipt_amt, we get $10,800. Is this an instance of people contributing over the limit?
From what I can tell, this instead seems to be double-counting! The FEC provides an Individual Contributor Search, which lets us look at each contributor record in more detail.
From there, I was able to piece this story:
If this story is true, then these donations are within the campaign contribution limits for primary and general elections. From an election integrity standpoint, this is good.
However, when doing analysis of this data, we should be aware of this discrepancy in our analysis. A contribution like Michelle Horowitz’s reattributed $5,400 may be double-counted in our analysis. Also, a large contribution of $10,800 by David Horowitz will count towards calculating the mean, even though it gets reattributed into smaller contributions later.
Democrats had the most contributions by far, which makes sense in California.
As you can see, there are contributions from outside of California.
## # A tibble: 139 × 5
## state zip city contbr_city contbr_st
## <chr> <chr> <chr> <chr> <chr>
## 1 NV 89411 Genoa GENOA CA
## 2 HI 96743 Kamuela SANTA YNEZ CA
## 3 WY 82717 Gillette GILLETTE CA
## 4 UT 84096 Herriman HERRIMAN CA
## 5 AP 96349 Fpo FPO AP CA
## 6 AP 96260 Apo APO CA
## 7 HI 96737 Ocean View BLUE JAY CA
## 8 OR 97209 Portland PORTLAND CA
## 9 NV 89052 Henderson HENDERSON CA
## 10 HI 96743 Kamuela SANTA YNEZ CA
## # ... with 129 more rows
Let’s restrict our visualization to known California zipcodes:
It seems like most of the contributions are centered around the major cities in California: Los Angeles, San Francisco, San Diego, and Sacramento.
Above, we look at the top 10 occupations and employers in our dataset.
As we can see, retirees make up a large chunk of our dataset, as do the self-employed.
There are 1,073,271 records in the dataset with 18 features.
The features are as follows:
Factors: Candidate name, election type (Primary 2016, General 2016, or Primary 2020)
Other observations:
I’m mostly interested in looking at patterns/differences in contributions among different candidates. So for me, the main features of interest are candidate name, contribution amount, date, and location.
Zipcode, employers, occupation, and party might provide other angles into the data.
I created a variable to represent each candidate’s political party.
In order to get geospatial information, I merged in data from the zipcode dataset, using the contbr_zipcode as the key. This merged in latitude and longitude information.
Finally, I was curious how donations correlated with votes, so I added in the primary vote totals and delegates, which I found on Wikipedia.
When histogramming the contribution amounts, I used a log scale since one of the bins was really large. This made it easier to see the rest of the data.
## contb_receipt_amt
## Min. :-5700.00
## 1st Qu.: 15.00
## Median : 27.00
## Mean : 50.57
## 3rd Qu.: 50.00
## Max. :10000.00
## contb_receipt_amt
## Min. :-5400.0
## 1st Qu.: 15.0
## Median : 25.0
## Mean : 146.3
## 3rd Qu.: 100.0
## Max. : 7300.0
## NA's :1
Since Sanders is often portrayed as the more progressive, blue-collar candidate, it is interesting to see that Clinton’s median donation is actually lower. It is interesting that Clinton and Sanders average donation amounts are roughly the same. Clinton’s median donation is actually lower, i.e. $27 vs. $25. Of course, this data does not include donations to Political Action Committees, so that could be a factor.
For the box plots, I sorted the candidates from highest number of donations to lowest.
In the first box plot, we can see that some candidates actually have many donations above the individual limit of $2700. Many also have negative donations, which could either be refunds or reattributions.
In the second box plot, I excluded negative contributions to see if we could see any other patterns.
Observations:
Last donation date could be a proxy variable for how long a campaign lasts. As we can see, this is positively correlated with the total number of donations.
Because campaigns can still receive donations after the campaign has been “suspended”, last donation date by itself isn’t a good indicator of when a campaign ends. For that, it’s better to look at a histogram of dates.
##
## Pearson's product-moment correlation
##
## data: total and n
## t = 12.959, df = 23, p-value = 4.697e-12
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.8622126 0.9725650
## sample estimates:
## cor
## 0.9378353
Here we look at a scatter log-log plot of total amount raised vs number of donations. It’s clear that there is a positive correlation here (Pearson correlation of 0.94). This makes sense intuitively, especially when you consider that individual campaign contributions are capped at $2,700. This requires many small donations in order to raise a lot of money.
Only a few candidates will actually receive any delegates, so it’s hard to make claims about the relationship between number of donations and delegate count. Perhaps it’s exponential.
This is on a log-log scale. There seems to be a positive correlation here.
This visualization again shows concentration of donation activity around cities, but also shows more domination by Democrats.
The top chart shows the top ten occupations for Democrats. The bottom chart shows the top ten occupations for Republicans.
Appears on both top ten lists:
Just Democrats:
Just Republicans:
Nothing interesting when comparing party vs. employer.
Here I looked at the top four candidates in the primaries: Clinton, Sanders, Trump, and Cruz. The charts appear in that order from left-to-right.
Each chart shows the top 10 occupations.
One thing that stood out to me is the percentage of donations that came from retirees. For Clinton, Trump, and Cruz, retirees make up more than 60% of donations. For Sanders, this is less than 10%. Instead, 60% of his donors are listed as “Not Employed.”
Number of contributions and total contributions are positively correlated.
Both of these features are positively correlated with number of votes and number of delegates in the primary election.
Tracking the donations through time can give us a sense of the story of the campaign.
For Democrats, the number of donations seems to roughly track what is going on in the campaign. Sanders gains a lot of interest throughout the campaign, peaks, and then declines as it becomes clearer he will not win the nomination. Clinton’s donations rise after the convention and throughout
Each point on the map represents a zipcode.
The first chart shows number of donations. The second charts hows total donated.
These two charts look similar, but when we compare them, it seems like Democratic money is more tightly concentrated around cities in the second chart than the first.
Tried faceting by party to see if anything stood out, but I don’t think this set of visualizations showed much more than the previous set.
Similar to the differences between Republicans and Democrats, Clinton’s support heavily draws from urban areas. Trump’s support appears to be more evenly split.
Zooming in on the Bay Area, and it looks like Democrats get a lot of their donations from urban cities than Republicans do.
Observations:
zipcode dataset and that from map_data.The Democratic and Republican parties receive donations from similar areas, with the Democrats receiving more donations from more densely populated areas.
I expected to see some difference between Clinton and Sanders support geographically but they were largely the same.
Each point on the map represents a zipcode. The size of the point representing the number of donations for zipcode. The color of the point represents the political party.
Over one million donations are visualized on this map. It’s clear that both parties draw support from more populated areas, but the Democrats especially draw support from urban cities.
This chart shows the battle between Hillary Clinton and Bernie Sanders using number donations.
We can see that Sanders peak around late March and early April of 2016, where he wins nine out of ten contests over clinton.
Donations decline as June 7th approaches, when Clinton clinches the nomination.
The Democratic National Convention was from July 25th to July 28th in 2016, where we see Clinton’s donation types switch from Primary to General.
This chart shows the Top 10 Occupations for each Candidate. I chose Clinton, Sanders, Trump, and Cruz because they were the top two candidates for their respective primaries.
Retirees make up the bulk of donations for each candidate except for Sanders, who drew a lot of his support from those listed as “Not Employed.”
This data set contains information on more than 1 million donations to the 2016 Presidential election campaigns in California. I started by understanding the individual variables in the data set, and then I explored interesting questions and leads as I continued to make observations on plots.
Visualizations of the data helped me spot problems in the data to fix. For instance, I needed to merge variants of the phrase “Self Employed” (like “Self-Employed” or “Self”).
Given that California is a solid Democratic supporting state, it was hard to tease out big differences between Republican and Democratic donations. I chose California because of my familiarity with the state, but looking at a swing state like Ohio or Pennsylvania may yield more interesting results.
Finally, I wish there were more information about each of the donors so I could do more analysis. It’s interesting that Sanders has a large number of unemployed supporters, but the current dataset does not give provide much information about them. For instance, I would like to understand the distribution of ages in this group (e.g. are they students?) The dataset as is doesn’t make it easy to dig into these sorts of questions.